Iteratively Extending Time Horizon Reinforcement Learning

نویسندگان

  • Damien Ernst
  • Pierre Geurts
  • Louis Wehenkel
چکیده

Reinforcement learning aims to determine an (infinite time horizon) optimal control policy from interaction with a system. It can be solved by approximating the so-called Q-function from a sample of four-tuples (xt, ut, rt, xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the optimal control from the Q-function. Classical reinforcement learning algorithms use an ad hoc version of stochastic approximation which iterates over the Q-function approximations on a four-tuple by four-tuple basis. In this paper, we reformulate this problem as a sequence of batch mode supervised learning problems which in the limit converges to (an approximation of) the Q-function. Each step of this algorithm uses the full sample of fourtuples gathered from interaction with the system and extends by one step the horizon of the optimality criterion. An advantage of this approach is to allow the use of standard batch mode supervised learning algorithms, instead of the incremental versions used up to now. In addition to a theoretical justification the paper provides empirical tests in the context of the “Car on the Hill” control problem based on the use of ensembles of regression trees. The resulting algorithm is in principle able to handle efficiently large scale reinforcement learning problems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reinforcement Learning with Time

This paper steps back from the standard infinite horizon formulation of reinforcement learning problems to consider the simpler case of finite horizon problems. Although finite horizon problems may be solved using infinite horizon learning algorithms by recasting the problem as an infinite horizon problem over a state space extended to include time, we show that such an application of infinite ...

متن کامل

Model Based Reinforcement Learning with Final Time Horizon Optimization

We present one of the first algorithms on model based reinforcement learning and trajectory optimization with free final time horizon. Grounded on the optimal control theory and Dynamic Programming, we derive a set of backward differential equations that propagate the value function and provide the optimal control policy and the optimal time horizon. The resulting policy generalizes previous re...

متن کامل

Solving for Best Responses and Equilibria in Extensive-Form Games with Reinforcement Learning Methods

We present a framework to solve for best responses and equilibria in an extensive-form game (EFG) of imperfect information by transforming the game into a set of Markov decision processes (MDPs), and then applying simulation-based reinforcement learning to those MDPs. More specifically, we first transform a turn-taking partially observable Markov game (TT-POMG) into a set (one per player) of pa...

متن کامل

Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning

In this paper, we propose to combine imitation and reinforcement learning via the idea of reward shaping using an oracle. We study the effectiveness of the nearoptimal cost-to-go oracle on the planning horizon and demonstrate that the costto-go oracle shortens the learner’s planning horizon as function of its accuracy: a globally optimal oracle can shorten the planning horizon to one, leading t...

متن کامل

Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning

In this paper, we propose to combine imitation and reinforcement learning via the idea of reward shaping using an oracle. We study the effectiveness of the nearoptimal cost-to-go oracle on the planning horizon and demonstrate that the costto-go oracle shortens the learner’s planning horizon as function of its accuracy: a globally optimal oracle can shorten the planning horizon to one, leading t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003